App : The name of the app
Category : The category of the app
Rating : The rating of the app in the Play Store
Reviews : The number of reviews of the app
Size : The size of the app
Install : The number of installs of the app
Type : The type of the app (Free/Paid)
Price : The price of the app (0 if it is Free)
Content Rating : The appropiate target audience of the app
Genres: The genre of the app
Last Updated : The date when the app was last updated
Current Ver : The current version of the app
Android Ver : The minimum Android version required to run the app
Step 1 | Setup and Initialization</p>
Importing Necessary Libraries
pip install missingno
Requirement already satisfied: missingno in c:\users\jahna\anaconda3\lib\site-packages (0.5.2)Note: you may need to restart the kernel to use updated packages.
WARNING: Ignoring invalid distribution -atplotlib (c:\users\jahna\anaconda3\lib\site-packages) WARNING: Ignoring invalid distribution -atplotlib (c:\users\jahna\anaconda3\lib\site-packages)
Requirement already satisfied: numpy in c:\users\jahna\anaconda3\lib\site-packages (from missingno) (1.24.3) Requirement already satisfied: matplotlib in c:\users\jahna\appdata\roaming\python\python39\site-packages (from missingno) (3.7.1) Requirement already satisfied: scipy in c:\users\jahna\anaconda3\lib\site-packages (from missingno) (1.11.3) Requirement already satisfied: seaborn in c:\users\jahna\anaconda3\lib\site-packages (from missingno) (0.13.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (23.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (10.0.1) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\jahna\anaconda3\lib\site-packages (from matplotlib->missingno) (5.12.0) Requirement already satisfied: pandas>=1.2 in c:\users\jahna\anaconda3\lib\site-packages (from seaborn->missingno) (2.0.3) Requirement already satisfied: zipp>=3.1.0 in c:\users\jahna\anaconda3\lib\site-packages (from importlib-resources>=3.2.0->matplotlib->missingno) (3.11.0) Requirement already satisfied: pytz>=2020.1 in c:\users\jahna\anaconda3\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\jahna\anaconda3\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\jahna\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
# Data
import numpy as np
import pandas as pd
from collections import defaultdict
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud
# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# Hide warnings
import warnings
warnings.filterwarnings('ignore')
Loading the Dataset
df = pd.read_csv("googleplaystore.csv")
Step 2 | Initial Data Analysis
Dataset Overview
df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
df.shape
(10841, 13)
df.columns
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')
df.describe()
| Rating | |
|---|---|
| count | 9367.000000 |
| mean | 4.193338 |
| std | 0.537431 |
| min | 1.000000 |
| 25% | 4.000000 |
| 50% | 4.300000 |
| 75% | 4.500000 |
| max | 19.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
Handling Data Types
As most of the features are set to data type object and have suffixes, each feature's data type must be converted into a suitable format for analysis.
Reviews
df[~df.Reviews.str.isnumeric()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
We could have converted it into integer like we did for Size but the data for this App looks different. It can be noticed that the entries are entered wrong We could fix it by setting Category as nan and shifting all the values, but deleting the sample for now.
df=df.drop(df.index[10472])
The feature Reviews must be of integer type.
df["Reviews"] = df["Reviews"].astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10840 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int32 4 Size 10840 non-null object 5 Installs 10840 non-null object 6 Type 10839 non-null object 7 Price 10840 non-null object 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null object 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), int32(1), object(11) memory usage: 1.1+ MB
Size
df['Size'].unique()
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
'5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
'1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
'3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
'4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
'4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
'23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
'8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
'5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
'6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
'45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
'10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
'5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
'100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
'99M', '624k', '95M', '8.5k', '41k', '292k', '11k', '80M', '1.7M',
'74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M',
'71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k',
'899k', '378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M',
'89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k',
'713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k',
'953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k',
'26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k',
'293k', '17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k',
'81k', '939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',
'283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
'976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
'350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
'417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
'429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
'506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
'319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
'716k', '585k', '982k', '222k', '219k', '55k', '948k', '323k',
'691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k',
'82k', '208k', '913k', '514k', '551k', '29k', '103k', '898k',
'743k', '116k', '153k', '209k', '353k', '499k', '173k', '597k',
'809k', '122k', '411k', '400k', '801k', '787k', '237k', '50k',
'643k', '986k', '97k', '516k', '837k', '780k', '961k', '269k',
'20k', '498k', '600k', '749k', '642k', '881k', '72k', '656k',
'601k', '221k', '228k', '108k', '940k', '176k', '33k', '663k',
'34k', '942k', '259k', '164k', '458k', '245k', '629k', '28k',
'288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k',
'914k', '903k', '608k', '500k', '54k', '562k', '847k', '957k',
'688k', '811k', '270k', '48k', '329k', '523k', '921k', '874k',
'981k', '784k', '280k', '24k', '518k', '754k', '892k', '154k',
'860k', '364k', '387k', '626k', '161k', '879k', '39k', '970k',
'170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k',
'246k', '73k', '658k', '992k', '253k', '420k', '404k', '470k',
'226k', '240k', '89k', '234k', '257k', '861k', '467k', '157k',
'44k', '676k', '67k', '552k', '885k', '1020k', '582k', '619k'],
dtype=object)
Remove all characters from size and convert it to float.
df['Size']=df['Size'].str.replace('M','000')
df['Size']=df['Size'].str.replace('k','')
#apps['size']=apps['size'].str.replace('.','')
df['Size']=df['Size'].replace("Varies with device",np.nan)
df['Size']=df['Size'].astype('float')
df['Size']
0 19000.0
1 14000.0
2 8.7
3 25000.0
4 2.8
...
10836 53000.0
10837 3.6
10838 9.5
10839 NaN
10840 19000.0
Name: Size, Length: 10840, dtype: float64
There is a problem! There are some applications size in megabyte and some in kilobyte.
###### Convert mega to kilo then convert all to mega
for i in df['Size']:
if i < 10:
df['Size']=df['Size'].replace(i,i*1000)
df['Size']=df['Size']/1000
df['Size']
0 19.0
1 14.0
2 8.7
3 25.0
4 2.8
...
10836 53.0
10837 3.6
10838 9.5
10839 NaN
10840 19.0
Name: Size, Length: 10840, dtype: float64
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10840 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int32 4 Size 9145 non-null float64 5 Installs 10840 non-null object 6 Type 10839 non-null object 7 Price 10840 non-null object 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null object 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object dtypes: float64(2), int32(1), object(10) memory usage: 1.1+ MB
Installs and Price
df['Installs'].unique()
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
'10+', '1+', '5+', '0+', '0'], dtype=object)
df['Price'].unique()
array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
'$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
'$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
'$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
'$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
'$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
'$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
'$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
'$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
'$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
'$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
'$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
'$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
'$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)
items_to_remove=['+',',','$']
cols_to_clean=['Installs','Price']
for item in items_to_remove:
for col in cols_to_clean:
df[col]=df[col].str.replace(item,'')
df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25.0 | 50000000 | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
df.Installs.unique()
array(['10000', '500000', '5000000', '50000000', '100000', '50000',
'1000000', '10000000', '5000', '100000000', '1000000000', '1000',
'500000000', '50', '100', '500', '10', '1', '5', '0'], dtype=object)
df['Price'].unique()
array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',
'3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
'24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
'12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
'74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
'25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
'4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
'299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
'1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
'2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
'4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
'200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
'1.20', '1.04'], dtype=object)
df[df['Price']=='Everyone']
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
|---|
df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10840 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int32 4 Size 9145 non-null float64 5 Installs 10840 non-null int32 6 Type 10839 non-null object 7 Price 10840 non-null float64 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null object 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object dtypes: float64(3), int32(2), object(8) memory usage: 1.1+ MB
last updated
Updating the Last Updated column's datatype from string to pandas datetime.
Extracting new columns Updated Year, Updated Month and updated day.
#### Change Last update into a datetime column
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Last Updated']
0 2018-01-07
1 2018-01-15
2 2018-08-01
3 2018-06-08
4 2018-06-20
...
10836 2017-07-25
10837 2018-07-06
10838 2017-01-20
10839 2015-01-19
10840 2018-07-25
Name: Last Updated, Length: 10840, dtype: datetime64[ns]
df['Updated_Month']=df['Last Updated'].dt.month
df['Updated_Year']=df['Last Updated'].dt.year
df.drop('Last Updated', axis=1, inplace=True)
df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Current Ver | Android Ver | Updated_Month | Updated_Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 1.0.0 | 4.0.3 and up | 1 | 2018 |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2.0.0 | 4.0.3 and up | 1 | 2018 |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 1.2.4 | 4.0.3 and up | 8 | 2018 |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25.0 | 50000000 | Free | 0.0 | Teen | Art & Design | Varies with device | 4.2 and up | 6 | 2018 |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 1.1 | 4.4 and up | 6 | 2018 |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10840 entries, 0 to 10840 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int32 4 Size 9145 non-null float64 5 Installs 10840 non-null int32 6 Type 10839 non-null object 7 Price 10840 non-null float64 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Current Ver 10832 non-null object 11 Android Ver 10838 non-null object 12 Updated_Month 10840 non-null int32 13 Updated_Year 10840 non-null int32 dtypes: float64(3), int32(4), object(7) memory usage: 1.1+ MB
Data cleaning
null = pd.DataFrame({'Null Values' : df.isna().sum().sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().sort_values(ascending=False)) / (df.shape[0]) * (100)})
null
| Null Values | Percentage Null Values | |
|---|---|---|
| Size | 1695 | 15.636531 |
| Rating | 1474 | 13.597786 |
| Current Ver | 8 | 0.073801 |
| Android Ver | 2 | 0.018450 |
| Type | 1 | 0.009225 |
| App | 0 | 0.000000 |
| Category | 0 | 0.000000 |
| Reviews | 0 | 0.000000 |
| Installs | 0 | 0.000000 |
| Price | 0 | 0.000000 |
| Content Rating | 0 | 0.000000 |
| Genres | 0 | 0.000000 |
| Updated_Month | 0 | 0.000000 |
| Updated_Year | 0 | 0.000000 |
null_counts = df.isna().sum().sort_values(ascending=False)/len(df)
plt.figure(figsize=(16,8))
plt.xticks(np.arange(len(null_counts))+0.5,null_counts.index,rotation='vertical')
plt.ylabel('fraction of rows with missing data')
plt.bar(np.arange(len(null_counts)),null_counts)
<BarContainer object of 14 artists>
We have missing values in Rating, Type, Content Rating, Current Ver and Android Ver.
Handling missing values
def impute_median(series):
return series.fillna(series.median())
# Impute missing values in 'Rating' and 'Size' columns with median
df['Rating'] = df['Rating'].transform(impute_median)
df['Size'] = df['Size'].transform(impute_median)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10840 entries, 0 to 10840 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 10840 non-null float64 3 Reviews 10840 non-null int32 4 Size 10840 non-null float64 5 Installs 10840 non-null int32 6 Type 10839 non-null object 7 Price 10840 non-null float64 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Current Ver 10832 non-null object 11 Android Ver 10838 non-null object 12 Updated_Month 10840 non-null int32 13 Updated_Year 10840 non-null int32 dtypes: float64(3), int32(4), object(7) memory usage: 1.1+ MB
df.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 1 Price 0 Content Rating 0 Genres 0 Current Ver 8 Android Ver 2 Updated_Month 0 Updated_Year 0 dtype: int64
# Fill missing values in 'Type', 'Current Ver', and 'Android Ver' columns
df['Type'].fillna(str(df['Type'].mode().values[0]), inplace=True)
df['Current Ver'].fillna(df['Current Ver'].mode().values[0], inplace=True)
df['Android Ver'].fillna(df['Android Ver'].mode().values[0], inplace=True)
df.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Current Ver 0 Android Ver 0 Updated_Month 0 Updated_Year 0 dtype: int64
Delete duplicated data
duplicate = df.duplicated()
print(duplicate.sum())
484
df.drop_duplicates(inplace=True)
duplicate = df.duplicated()
print(duplicate.sum())
0
Extract Numerical and categorical features
num_features=[col for col in df.columns if df[col].dtype!='O']
num_features
['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Updated_Month', 'Updated_Year']
cat_features=[col for col in df.columns if df[col].dtype=='O']
cat_features
['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver']
Checking outliers
sns.boxplot(df["Rating"])
<Axes: ylabel='Rating'>
sns.boxplot(df["Reviews"])
<Axes: ylabel='Reviews'>
sns.boxplot(df["Size"])
<Axes: ylabel='Size'>
sns.boxplot(df["Installs"])
<Axes: ylabel='Installs'>
sns.boxplot(df["Price"])
<Axes: ylabel='Price'>
sns.boxplot(df["Updated_Month"])
<Axes: ylabel='Updated_Month'>
sns.boxplot(df["Updated_Year"])
<Axes: ylabel='Updated_Year'>
# Calculate IQR for each column
Q1_rating = df['Rating'].quantile(0.25)
Q3_rating = df['Rating'].quantile(0.75)
IQR_rating = Q3_rating - Q1_rating
lower_bound_rating = Q1_rating - 1.5 * IQR_rating
upper_bound_rating = Q3_rating + 1.5 * IQR_rating
outliers_rating = df[(df['Rating'] < lower_bound_rating) | (df['Rating'] > upper_bound_rating)]
Q1_reviews = df['Reviews'].quantile(0.25)
Q3_reviews = df['Reviews'].quantile(0.75)
IQR_reviews = Q3_reviews - Q1_reviews
lower_bound_reviews = Q1_reviews - 1.5 * IQR_reviews
upper_bound_reviews = Q3_reviews + 1.5 * IQR_reviews
outliers_reviews = df[(df['Reviews'] < lower_bound_reviews) | (df['Reviews'] > upper_bound_reviews)]
Q1_size = df['Size'].quantile(0.25)
Q3_size = df['Size'].quantile(0.75)
IQR_size = Q3_size - Q1_size
lower_bound_size = Q1_size - 1.5 * IQR_size
upper_bound_size = Q3_size + 1.5 * IQR_size
outliers_size = df[(df['Size'] < lower_bound_size) | (df['Size'] > upper_bound_size)]
Q1_installs = df['Installs'].quantile(0.25)
Q3_installs = df['Installs'].quantile(0.75)
IQR_installs = Q3_installs - Q1_installs
lower_bound_installs = Q1_installs - 1.5 * IQR_installs
upper_bound_installs = Q3_installs + 1.5 * IQR_installs
outliers_installs = df[(df['Installs'] < lower_bound_installs) | (df['Installs'] > upper_bound_installs)]
Q1_price = df['Price'].quantile(0.25)
Q3_price = df['Price'].quantile(0.75)
IQR_price = Q3_price - Q1_price
lower_bound_price = Q1_price - 1.5 * IQR_price
upper_bound_price = Q3_price + 1.5 * IQR_price
outliers_price = df[(df['Price'] < lower_bound_price) | (df['Price'] > upper_bound_price)]
Q1_month = df['Updated_Month'].quantile(0.25)
Q3_month = df['Updated_Month'].quantile(0.75)
IQR_month = Q3_month - Q1_month
lower_bound_month = Q1_month - 1.5 * IQR_month
upper_bound_month = Q3_month + 1.5 * IQR_month
outliers_month = df[(df['Updated_Month'] < lower_bound_month) | (df['Updated_Month'] > upper_bound_month)]
Q1_year = df['Updated_Year'].quantile(0.25)
Q3_year = df['Updated_Year'].quantile(0.75)
IQR_year = Q3_year - Q1_year
lower_bound_year = Q1_year - 1.5 * IQR_year
upper_bound_year = Q3_year + 1.5 * IQR_year
outliers_year = df[(df['Updated_Year'] < lower_bound_year) | (df['Updated_Year'] > upper_bound_year)]
# Concatenate all outliers
all_outliers = pd.concat([outliers_rating, outliers_reviews, outliers_size,
outliers_installs, outliers_price, outliers_month, outliers_year])
# View outliers for all columns
print("Outliers in all columns:")
print(all_outliers)
Outliers in all columns:
App Category Rating Reviews \
15 Learn To Draw Kawaii Characters ART_AND_DESIGN 3.2 55
87 RST - Sale of cars on the PCT AUTO_AND_VEHICLES 3.2 250
159 Cloud of Books BOOKS_AND_REFERENCE 3.3 1862
176 Free Book Reader BOOKS_AND_REFERENCE 3.4 1680
209 Plugin:AOT v5.0 BUSINESS 3.1 4034
... ... ... ... ...
10817 HTC Sense Input - FR TOOLS 4.0 885
10830 News Minecraft.fr NEWS_AND_MAGAZINES 3.8 881
10832 FR Tides WEATHER 3.8 1195
10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114
Size Installs Type Price Content Rating Genres \
15 2.700 5000 Free 0.0 Everyone Art & Design
87 1.100 100000 Free 0.0 Everyone Auto & Vehicles
159 19.000 1000000 Free 0.0 Everyone Books & Reference
176 4.000 100000 Free 0.0 Everyone Books & Reference
209 0.023 100000 Free 0.0 Everyone Business
... ... ... ... ... ... ...
10817 8.000 100000 Free 0.0 Everyone Tools
10830 2.300 100000 Free 0.0 Everyone News & Magazines
10832 0.582 100000 Free 0.0 Everyone Weather
10833 0.619 1000 Free 0.0 Everyone Books & Reference
10839 13.000 1000 Free 0.0 Mature 17+ Books & Reference
Current Ver Android Ver Updated_Month Updated_Year
15 Varies with device 4.2 and up 6 2018
87 1.4 4.0.3 and up 4 2018
159 2.2.5 4.1 and up 4 2018
176 3.05 4.0.3 and up 8 2016
209 3.0.1.11 (Build 311) 2.2 and up 9 2015
... ... ... ... ...
10817 1.0.612928 5.0 and up 10 2015
10830 1.5 1.6 and up 1 2014
10832 6.0 2.1 and up 2 2014
10833 0.8 2.2 and up 3 2014
10839 Varies with device Varies with device 1 2015
[7561 rows x 14 columns]
# Assuming 'all_outliers' is the concatenated DataFrame
total_rows = len(df)
total_outliers = len(all_outliers)
# Calculate percentage of outliers
percentage_outliers = (total_outliers / total_rows) * 100
print(f"Percentage of outliers in the concatenated dataset: {percentage_outliers:.2f}%")
Percentage of outliers in the concatenated dataset: 73.01%
# Identify and handle outliers
df_no_outliers = df[
(df['Rating'] >= lower_bound_rating) & (df['Rating'] <= upper_bound_rating) &
(df['Reviews'] >= lower_bound_reviews) & (df['Reviews'] <= upper_bound_reviews) &
(df['Size'] >= lower_bound_size) & (df['Size'] <= upper_bound_size) &
(df['Installs'] >= lower_bound_installs) & (df['Installs'] <= upper_bound_installs) &
(df['Price'] >= lower_bound_price) & (df['Price'] <= upper_bound_price) &
(df['Updated_Month'] >= lower_bound_month) & (df['Updated_Month'] <= upper_bound_month) &
(df['Updated_Year'] >= lower_bound_year) & (df['Updated_Year'] <= upper_bound_year)
]
print(df_no_outliers)
App Category Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
1 Coloring book moana ART_AND_DESIGN 3.9
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
5 Paper flowers instructions ART_AND_DESIGN 4.4
6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8
... ... ... ...
10834 FR Calculator FAMILY 4.0
10835 FR Forms BUSINESS 4.3
10836 Sya9a Maroc - FR FAMILY 4.5
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0
10838 Parkinson Exercices FR MEDICAL 4.3
Reviews Size Installs Type Price Content Rating \
0 159 19.0 10000 Free 0.0 Everyone
1 967 14.0 500000 Free 0.0 Everyone
4 967 2.8 100000 Free 0.0 Everyone
5 167 5.6 50000 Free 0.0 Everyone
6 178 19.0 50000 Free 0.0 Everyone
... ... ... ... ... ... ...
10834 7 2.6 500 Free 0.0 Everyone
10835 0 9.6 10 Free 0.0 Everyone
10836 38 53.0 5000 Free 0.0 Everyone
10837 4 3.6 100 Free 0.0 Everyone
10838 3 9.5 1000 Free 0.0 Everyone
Genres Current Ver Android Ver Updated_Month \
0 Art & Design 1.0.0 4.0.3 and up 1
1 Art & Design;Pretend Play 2.0.0 4.0.3 and up 1
4 Art & Design;Creativity 1.1 4.4 and up 6
5 Art & Design 1.0 2.3 and up 3
6 Art & Design 1.1 4.0.3 and up 4
... ... ... ... ...
10834 Education 1.0.0 4.1 and up 6
10835 Business 1.1.5 4.0 and up 9
10836 Education 1.48 4.1 and up 7
10837 Education 1.0 4.1 and up 7
10838 Medical 1.0 2.2 and up 1
Updated_Year
0 2018
1 2018
4 2018
5 2017
6 2018
... ...
10834 2017
10835 2016
10836 2017
10837 2018
10838 2017
[5467 rows x 14 columns]
# Check outliers for Rating column in df_no_outliers
outliers_rating = df_no_outliers[
(df_no_outliers['Rating'] < lower_bound_rating) | (df_no_outliers['Rating'] > upper_bound_rating)
]
# Print the outliers for Rating column
print("Outliers in Rating column:")
print(outliers_rating[['Rating']])
Outliers in Rating column: Empty DataFrame Columns: [Rating] Index: []
# Check outliers for Reviews column in df_no_outliers
outliers_reviews = df_no_outliers[
(df_no_outliers['Reviews'] < lower_bound_reviews) | (df_no_outliers['Reviews'] > upper_bound_reviews)
]
# Print the outliers for Reviews column
print("Outliers in Reviews column:")
print(outliers_reviews[['Reviews']])
Outliers in Reviews column: Empty DataFrame Columns: [Reviews] Index: []
# Display descriptive statistics for numerical columns after removing outliers
print(df_no_outliers.describe())
Rating Reviews Size Installs Price \
count 5467.000000 5467.000000 5467.000000 5467.000000 5467.0
mean 4.295811 6926.320834 14.655660 266178.044449 0.0
std 0.324270 15542.957188 12.575657 400585.587536 0.0
min 3.500000 0.000000 0.010000 0.000000 0.0
25% 4.100000 9.000000 4.800000 1000.000000 0.0
50% 4.300000 222.000000 12.000000 10000.000000 0.0
75% 4.500000 5100.000000 21.000000 500000.000000 0.0
max 5.000000 112565.000000 56.000000 1000000.000000 0.0
Updated_Month Updated_Year
count 5467.000000 5467.000000
mean 6.156393 2017.585147
std 2.707313 0.658074
min 1.000000 2016.000000
25% 4.000000 2017.000000
50% 7.000000 2018.000000
75% 8.000000 2018.000000
max 12.000000 2018.000000
df['Category'].value_counts()
Category FAMILY 1943 GAME 1121 TOOLS 842 BUSINESS 427 MEDICAL 408 PRODUCTIVITY 407 PERSONALIZATION 388 LIFESTYLE 373 COMMUNICATION 366 FINANCE 360 SPORTS 351 PHOTOGRAPHY 322 HEALTH_AND_FITNESS 306 SOCIAL 280 NEWS_AND_MAGAZINES 264 TRAVEL_AND_LOCAL 237 BOOKS_AND_REFERENCE 230 SHOPPING 224 DATING 196 VIDEO_PLAYERS 175 MAPS_AND_NAVIGATION 137 EDUCATION 130 FOOD_AND_DRINK 124 ENTERTAINMENT 111 AUTO_AND_VEHICLES 85 LIBRARIES_AND_DEMO 85 WEATHER 82 HOUSE_AND_HOME 80 ART_AND_DESIGN 65 EVENTS 64 PARENTING 60 COMICS 60 BEAUTY 53 Name: count, dtype: int64
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(x='Category',data=df)
plt.xticks(rotation=70)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], [Text(0, 0, 'ART_AND_DESIGN'), Text(1, 0, 'AUTO_AND_VEHICLES'), Text(2, 0, 'BEAUTY'), Text(3, 0, 'BOOKS_AND_REFERENCE'), Text(4, 0, 'BUSINESS'), Text(5, 0, 'COMICS'), Text(6, 0, 'COMMUNICATION'), Text(7, 0, 'DATING'), Text(8, 0, 'EDUCATION'), Text(9, 0, 'ENTERTAINMENT'), Text(10, 0, 'EVENTS'), Text(11, 0, 'FINANCE'), Text(12, 0, 'FOOD_AND_DRINK'), Text(13, 0, 'HEALTH_AND_FITNESS'), Text(14, 0, 'HOUSE_AND_HOME'), Text(15, 0, 'LIBRARIES_AND_DEMO'), Text(16, 0, 'LIFESTYLE'), Text(17, 0, 'GAME'), Text(18, 0, 'FAMILY'), Text(19, 0, 'MEDICAL'), Text(20, 0, 'SOCIAL'), Text(21, 0, 'SHOPPING'), Text(22, 0, 'PHOTOGRAPHY'), Text(23, 0, 'SPORTS'), Text(24, 0, 'TRAVEL_AND_LOCAL'), Text(25, 0, 'TOOLS'), Text(26, 0, 'PERSONALIZATION'), Text(27, 0, 'PRODUCTIVITY'), Text(28, 0, 'PARENTING'), Text(29, 0, 'WEATHER'), Text(30, 0, 'VIDEO_PLAYERS'), Text(31, 0, 'NEWS_AND_MAGAZINES'), Text(32, 0, 'MAPS_AND_NAVIGATION')])
plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
background_color='black',
width=1920,
height=1080
).generate(" ".join(df.Category))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.figure(figsize=(20,15))
sns.boxplot(y='Rating',x='Category',data = df.sort_values('Rating',ascending=False))
plt.xticks(rotation=80)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], [Text(0, 0, 'FAMILY'), Text(1, 0, 'HEALTH_AND_FITNESS'), Text(2, 0, 'SHOPPING'), Text(3, 0, 'LIFESTYLE'), Text(4, 0, 'TOOLS'), Text(5, 0, 'COMMUNICATION'), Text(6, 0, 'ART_AND_DESIGN'), Text(7, 0, 'COMICS'), Text(8, 0, 'PERSONALIZATION'), Text(9, 0, 'GAME'), Text(10, 0, 'MEDICAL'), Text(11, 0, 'BUSINESS'), Text(12, 0, 'PRODUCTIVITY'), Text(13, 0, 'NEWS_AND_MAGAZINES'), Text(14, 0, 'FINANCE'), Text(15, 0, 'SOCIAL'), Text(16, 0, 'PHOTOGRAPHY'), Text(17, 0, 'BOOKS_AND_REFERENCE'), Text(18, 0, 'SPORTS'), Text(19, 0, 'FOOD_AND_DRINK'), Text(20, 0, 'PARENTING'), Text(21, 0, 'EVENTS'), Text(22, 0, 'TRAVEL_AND_LOCAL'), Text(23, 0, 'DATING'), Text(24, 0, 'LIBRARIES_AND_DEMO'), Text(25, 0, 'MAPS_AND_NAVIGATION'), Text(26, 0, 'VIDEO_PLAYERS'), Text(27, 0, 'EDUCATION'), Text(28, 0, 'AUTO_AND_VEHICLES'), Text(29, 0, 'BEAUTY'), Text(30, 0, 'WEATHER'), Text(31, 0, 'HOUSE_AND_HOME'), Text(32, 0, 'ENTERTAINMENT')])
df['Type'].value_counts()
Type Free 9591 Paid 765 Name: count, dtype: int64
plt.rcParams['figure.figsize'] = (8,5)
sns.countplot(x='Type',data=df)
plt.xticks(rotation=70)
([0, 1], [Text(0, 0, 'Free'), Text(1, 0, 'Paid')])
df["Type"].value_counts().plot.pie(autopct = "%1.1f%%")
<Axes: ylabel='count'>
plt.figure(figsize=(15,8))
sns.catplot(y='Rating',x='Type',data = df.sort_values('Rating',ascending=False),kind='boxen')
<seaborn.axisgrid.FacetGrid at 0x180b328e2b0>
<Figure size 1500x800 with 0 Axes>
df['Content Rating'].value_counts()
Content Rating Everyone 8381 Teen 1146 Mature 17+ 447 Everyone 10+ 377 Adults only 18+ 3 Unrated 2 Name: count, dtype: int64
plt.figure(figsize=(12,8))
sns.boxplot(y='Rating',x='Content Rating',data = df.sort_values('Rating',ascending=False))
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5], [Text(0, 0, 'Everyone'), Text(1, 0, 'Teen'), Text(2, 0, 'Mature 17+'), Text(3, 0, 'Everyone 10+'), Text(4, 0, 'Adults only 18+'), Text(5, 0, 'Unrated')])
plt.figure(figsize=(12,8))
sns.barplot(x="Content Rating", y="Installs", hue="Type", data=df)
<Axes: xlabel='Content Rating', ylabel='Installs'>
df['Genres'].value_counts()
Genres
Tools 841
Entertainment 588
Education 527
Business 427
Medical 408
...
Parenting;Brain Games 1
Travel & Local;Action & Adventure 1
Lifestyle;Pretend Play 1
Tools;Education 1
Strategy;Creativity 1
Name: count, Length: 119, dtype: int64
df['Current Ver'].value_counts()
Current Ver
Varies with device 1310
1.0 802
1.1 260
1.2 177
2.0 149
...
3.18.5 1
1.3.A.2.9 1
9.9.1.1910 1
7.1.34.28 1
2.0.148.0 1
Name: count, Length: 2831, dtype: int64
df['Android Ver'].value_counts()
Android Ver 4.1 and up 2381 4.0.3 and up 1451 4.0 and up 1337 Varies with device 1221 4.4 and up 893 2.3 and up 643 5.0 and up 546 4.2 and up 387 2.3.3 and up 279 2.2 and up 239 3.0 and up 237 4.3 and up 235 2.1 and up 133 1.6 and up 116 6.0 and up 58 7.0 and up 42 3.2 and up 36 2.0 and up 32 5.1 and up 22 1.5 and up 20 4.4W and up 11 3.1 and up 10 2.0.1 and up 7 8.0 and up 6 7.1 and up 3 4.0.3 - 7.1.1 2 5.0 - 8.0 2 1.0 and up 2 7.0 - 7.1.1 1 4.1 - 7.1.1 1 5.0 - 6.0 1 2.2 - 7.1.1 1 5.0 - 7.1.1 1 Name: count, dtype: int64
# Function to create a scatter plot
def scatters(col1, col2):
# Create a scatter plot using Seaborn
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
sns.scatterplot(data=df, x=col1, y=col2, hue="Type")
plt.title(f'Scatter Plot of {col1} vs {col2}')
plt.xlabel(col1)
plt.ylabel(col2)
plt.show()
# Function to create a KDE plot
def kde_plot(feature):
# Create a FacetGrid for KDE plots using Seaborn
grid = sns.FacetGrid(df, hue="Type", aspect=2)
# Map KDE plots for the specified feature
grid.map(sns.kdeplot, feature)
# Add a legend to distinguish between categories
grid.add_legend()
kde_plot('Rating')
kde_plot('Size')
kde_plot('Updated_Month')
kde_plot('Price')
kde_plot('Updated_Year')
scatters('Price', 'Updated_Year')
scatters('Size', 'Rating')
scatters('Size', 'Installs')
scatters('Updated_Month', 'Installs')
scatters('Reviews', 'Rating')
scatters('Rating', 'Price')
df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')
There are 271 apps having rating of 5.0
sns.histplot(data=df_rating_5, x='Installs', kde=True, bins=50)
plt.title('Distribution of Installs with 5.0 Rating Apps')
plt.show()
Despite the full ratings, the number of installations for the majority of the apps is low. Hence, those apps cannot be considered the best products.
sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()
The distribution is right-skewed which shows applications with few reviews having 5.0 ratings, which is misleading.
df_rating_5_cat = df_rating_5['Category'].value_counts().reset_index()
# Create a pie chart
plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
plt.pie(df_rating_5_cat.iloc[:, 1], labels=df_rating_5_cat.iloc[:, 0], autopct='%1.1f%%')
plt.title('Pie chart of App Categories with 5.0 Rating')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# Show the pie chart
plt.show()
Family, LifeStyle and Medical apps receive the most 5.0 ratings on Google Play Store with Family representing about quater of whole.
df_rating_5_type = df_rating_5['Type'].value_counts().reset_index()
# Create a pie chart
plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
# Data for the pie chart
sizes = df_rating_5_type.iloc[:, 1]
labels = df_rating_5_type.iloc[:, 0]
# Pull a slice out by exploding it
explode = (0, 0.1) # Adjust the second value to control the pull-out distance
# Create the pie chart with default colors
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, pctdistance=0.85, explode=explode)
# Draw a circle in the center to make it look like a donut chart
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle.
plt.axis('equal')
# Title
plt.title('Pie chart of App Types with 5.0 Rating')
# Show the pie chart
plt.show()
Almost 90% of the 5.0 rating apps are free on Goolge Play Store.
# Time Series Plot of Last Updates
freq= pd.Series()
freq=df['Updated_Year'].value_counts()
freq.plot()
plt.xlabel("Dates")
plt.ylabel("Number of updates")
plt.title("Time series plot of Last Updates")
Text(0.5, 1.0, 'Time series plot of Last Updates')
Feature Pruning
We decide to prune the following features:
pruned_features = ['App', 'Genres', 'Current Ver', 'Android Ver']
Step 6 | Data Splitting for Modeling
target = 'Rating'
X = df.copy().drop(pruned_features+[target], axis=1)
y = df.copy()[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
le_dict = defaultdict()
features_to_encode = X_train.select_dtypes(include=['category', 'object']).columns
for col in features_to_encode:
le = LabelEncoder()
X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the Train data
X_train[col] = X_train[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas
X_test[col] = le.transform(X_test[col]) # Only transforming the test data
X_test[col] = X_test[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas
le_dict[col] = le # Saving the label encoder for individual features
# Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Updated_Month']
X_train['Updated_Month'] = X_train['Updated_Month'].astype('category')
X_test['Updated_Month'] = X_test['Updated_Month'].astype('category')
# Listing numeric features to scale
numeric_features = X_train.select_dtypes(exclude=['category', 'object']).columns
numeric_features
Index(['Reviews', 'Size', 'Installs', 'Price', 'Updated_Year'], dtype='object')
scaler = StandardScaler()
# Fitting and transforming the Training data
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
# X_train = scaler.fit_transform(X_train)
# Only transforming the Test data
X_test[numeric_features] = scaler.transform(X_test[numeric_features])
# X_test = scaler.transform(X_test)
Step 7 | Modeling
Regression
models = ['Linear', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['RMSE', 'MAE', 'R2']
multi_index = pd.MultiIndex.from_product([models, datasets, metrics],
names=['model', 'dataset', 'metric'])
df_metrics_reg = pd.DataFrame(index=multi_index,
columns=['value'])
df_metrics_reg
| value | |||
|---|---|---|---|
| model | dataset | metric | |
| Linear | train | RMSE | NaN |
| MAE | NaN | ||
| R2 | NaN | ||
| test | RMSE | NaN | |
| MAE | NaN | ||
| R2 | NaN | ||
| KNN | train | RMSE | NaN |
| MAE | NaN | ||
| R2 | NaN | ||
| test | RMSE | NaN | |
| MAE | NaN | ||
| R2 | NaN | ||
| Random Forest | train | RMSE | NaN |
| MAE | NaN | ||
| R2 | NaN | ||
| test | RMSE | NaN | |
| MAE | NaN | ||
| R2 | NaN |
lr = LinearRegression()
lr.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
df_metrics_reg.loc['Linear', 'train', 'R2'] = lr.score(X_train, y_train)
df_metrics_reg.loc['Linear', 'test', 'R2'] = lr.score(X_test, y_test)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
df_metrics_reg.loc['Linear', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Linear', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)
df_metrics_reg.loc['Linear', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Linear', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
KNeighborsRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsRegressor()
df_metrics_reg.loc['KNN', 'train', 'R2'] = knn.score(X_train, y_train)
df_metrics_reg.loc['KNN', 'test', 'R2'] = knn.score(X_test, y_test)
y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)
df_metrics_reg.loc['KNN', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['KNN', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)
df_metrics_reg.loc['KNN', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['KNN', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)
RandomForestRegressor(max_depth=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(max_depth=2, random_state=0)
df_metrics_reg.loc['Random Forest', 'train', 'R2'] = rf.score(X_train, y_train)
df_metrics_reg.loc['Random Forest', 'test', 'R2'] = rf.score(X_test, y_test)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
df_metrics_reg.loc['Random Forest', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Random Forest', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)
df_metrics_reg.loc['Random Forest', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Random Forest', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)
# Rounding the values
df_metrics_reg['value'] = df_metrics_reg['value'].apply(lambda v: round(v, ndigits=3))
df_metrics_reg
| value | |||
|---|---|---|---|
| model | dataset | metric | |
| Linear | train | RMSE | 0.478 |
| MAE | 0.319 | ||
| R2 | 0.023 | ||
| test | RMSE | 0.483 | |
| MAE | 0.327 | ||
| R2 | 0.037 | ||
| KNN | train | RMSE | 0.409 |
| MAE | 0.280 | ||
| R2 | 0.286 | ||
| test | RMSE | 0.510 | |
| MAE | 0.349 | ||
| R2 | -0.072 | ||
| Random Forest | train | RMSE | 0.468 |
| MAE | 0.309 | ||
| R2 | 0.063 | ||
| test | RMSE | 0.472 | |
| MAE | 0.314 | ||
| R2 | 0.081 |
data = df_metrics_reg.reset_index()
g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='value', hue='metric')
# Adding annotations to bars
# iterate through axes
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
ax.bar_label(c, label_type='edge')
ax.margins(y=0.2)
plt.show()
The Regression predictions don't hold up very well!
We can interpret that the dataset is not suitable for regression problem.
y_train_int = y_train.astype(int)
y_test_int = y_test.astype(int)
# Create a MultiIndex for the DataFrame
models = ['Logistic Regression', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['accuracy %', 'precision', 'recall', 'f1']
multi_index = pd.MultiIndex.from_product([models, datasets, metrics],
names=['model', 'dataset', 'metric'])
# Create an empty DataFrame with the MultiIndex
df_metrics_clf = pd.DataFrame(index=multi_index, columns=['value'])
# Check for class imbalance
train_rating_distribution = y_train_int.value_counts(normalize=True)
test_rating_distribution = y_test_int.value_counts(normalize=True)
print("Training Set Rating Distribution:")
print(train_rating_distribution)
Training Set Rating Distribution: Rating 4 0.789594 3 0.158740 5 0.025350 2 0.020884 1 0.005432 Name: proportion, dtype: float64
print("\nTest Set Rating Distribution:")
print(test_rating_distribution)
Test Set Rating Distribution: Rating 4 0.771718 3 0.168919 5 0.029440 2 0.025097 1 0.004826 Name: proportion, dtype: float64
# Calculate class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_int), y=y_train_int)
# Logistic Regression Classifier
lr_clf = LogisticRegression(class_weight='balanced')
lr_clf.fit(X_train, y_train_int)
LogisticRegression(class_weight='balanced')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(class_weight='balanced')
# Update the DataFrame for accuracy
df_metrics_clf.loc[('Logistic Regression', 'train', 'accuracy %'), 'value'] = lr_clf.score(X_train, y_train_int) * 100
df_metrics_clf.loc[('Logistic Regression', 'test', 'accuracy %'), 'value'] = lr_clf.score(X_test, y_test_int) * 100
# Precision, Recall, and F1 Score for Logistic Regression
y_train_pred_lr = lr_clf.predict(X_train)
y_test_pred_lr = lr_clf.predict(X_test)
precision_train_lr = precision_score(y_train_int, y_train_pred_lr, average='weighted') * 100
precision_test_lr = precision_score(y_test_int, y_test_pred_lr, average='weighted') * 100
recall_train_lr = recall_score(y_train_int, y_train_pred_lr, average='weighted') * 100
recall_test_lr = recall_score(y_test_int, y_test_pred_lr, average='weighted') * 100
f1_train_lr = f1_score(y_train_int, y_train_pred_lr, average='weighted') * 100
f1_test_lr = f1_score(y_test_int, y_test_pred_lr, average='weighted') * 100
# Update the DataFrame for precision, recall, and F1 score
df_metrics_clf.loc[('Logistic Regression', 'train', 'precision'), 'value'] = precision_train_lr
df_metrics_clf.loc[('Logistic Regression', 'test', 'precision'), 'value'] = precision_test_lr
df_metrics_clf.loc[('Logistic Regression', 'train', 'recall'), 'value'] = recall_train_lr
df_metrics_clf.loc[('Logistic Regression', 'test', 'recall'), 'value'] = recall_test_lr
df_metrics_clf.loc[('Logistic Regression', 'train', 'f1'), 'value'] = f1_train_lr
df_metrics_clf.loc[('Logistic Regression', 'test', 'f1'), 'value'] = f1_test_lr
# KNN Classifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train_int)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
# Update the DataFrame for accuracy
df_metrics_clf.loc[('KNN', 'train', 'accuracy %'), 'value'] = knn_clf.score(X_train, y_train_int) * 100
df_metrics_clf.loc[('KNN', 'test', 'accuracy %'), 'value'] = knn_clf.score(X_test, y_test_int) * 100
# Precision, Recall, and F1 Score for KNN
y_train_pred_knn = knn_clf.predict(X_train)
y_test_pred_knn = knn_clf.predict(X_test)
precision_train_knn = precision_score(y_train_int, y_train_pred_knn, average='weighted') * 100
precision_test_knn = precision_score(y_test_int, y_test_pred_knn, average='weighted') * 100
recall_train_knn = recall_score(y_train_int, y_train_pred_knn, average='weighted') * 100
recall_test_knn = recall_score(y_test_int, y_test_pred_knn, average='weighted') * 100
f1_train_knn = f1_score(y_train_int, y_train_pred_knn, average='weighted') * 100
f1_test_knn = f1_score(y_test_int, y_test_pred_knn, average='weighted') * 100
# Update the DataFrame for precision, recall, and F1 score
df_metrics_clf.loc[('KNN', 'train', 'precision'), 'value'] = precision_train_knn
df_metrics_clf.loc[('KNN', 'test', 'precision'), 'value'] = precision_test_knn
df_metrics_clf.loc[('KNN', 'train', 'recall'), 'value'] = recall_train_knn
df_metrics_clf.loc[('KNN', 'test', 'recall'), 'value'] = recall_test_knn
df_metrics_clf.loc[('KNN', 'train', 'f1'), 'value'] = f1_train_knn
df_metrics_clf.loc[('KNN', 'test', 'f1'), 'value'] = f1_test_knn
# Define the parameter grid
param_grid = {
'n_estimators': [100, 150,200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Create the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
# Fit the grid search to the data
grid_search.fit(X_train, y_train_int)
# Print the best parameters found
print("Best Hyperparameters:", grid_search.best_params_)
# Get the best model
best_rf = grid_search.best_estimator_
# Evaluate on training set
train_predictions = best_rf.predict(X_train)
train_accuracy = accuracy_score(y_train_int, train_predictions)
train_precision = precision_score(y_train_int, train_predictions, average='weighted')
train_recall = recall_score(y_train_int, train_predictions, average='weighted')
train_f1 = f1_score(y_train_int, train_predictions, average='weighted')
# Evaluate on test set
test_predictions = best_rf.predict(X_test)
test_accuracy = accuracy_score(y_test_int, test_predictions)
test_precision = precision_score(y_test_int, test_predictions, average='weighted')
test_recall = recall_score(y_test_int, test_predictions, average='weighted')
test_f1 = f1_score(y_test_int, test_predictions, average='weighted')
print(f'Training Accuracy (Tuned): {train_accuracy}')
print(f'Training Precision (Tuned): {train_precision}')
print(f'Training Recall (Tuned): {train_recall}')
print(f'Training F1 Score (Tuned): {train_f1}')
print(f'Test Accuracy (Tuned): {test_accuracy}')
print(f'Test Precision (Tuned): {test_precision}')
print(f'Test Recall (Tuned): {test_recall}')
print(f'Test F1 Score (Tuned): {test_f1}')
Best Hyperparameters: {'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
Training Accuracy (Tuned): 0.8349830999517142
Training Precision (Tuned): 0.8268175042775675
Training Recall (Tuned): 0.8349830999517142
Training F1 Score (Tuned): 0.7880559561185819
Test Accuracy (Tuned): 0.7765444015444015
Test Precision (Tuned): 0.706450817261628
Test Recall (Tuned): 0.7765444015444015
Test F1 Score (Tuned): 0.6936858020646786
# Random Forest Classifier
rf_clf = RandomForestClassifier(class_weight=dict(zip(np.unique(y_train_int), class_weights)))
rf_clf.fit(X_train, y_train_int)
RandomForestClassifier(class_weight={1: 36.81777777777778, 2: 9.576878612716763,
3: 1.2599239543726235,
4: 0.253294603271671,
5: 7.889523809523809})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={1: 36.81777777777778, 2: 9.576878612716763,
3: 1.2599239543726235,
4: 0.253294603271671,
5: 7.889523809523809})# Update the DataFrame for accuracy
df_metrics_clf.loc[('Random Forest', 'train', 'accuracy %'), 'value'] = rf_clf.score(X_train, y_train_int) * 100
df_metrics_clf.loc[('Random Forest', 'test', 'accuracy %'), 'value'] = rf_clf.score(X_test, y_test_int) * 100
# Precision, Recall, and F1 Score for Random Forest
y_train_pred_rf = rf_clf.predict(X_train)
y_test_pred_rf = rf_clf.predict(X_test)
precision_train_rf = precision_score(y_train_int, y_train_pred_rf, average='weighted') * 100
precision_test_rf = precision_score(y_test_int, y_test_pred_rf, average='weighted') * 100
recall_train_rf = recall_score(y_train_int, y_train_pred_rf, average='weighted') * 100
recall_test_rf = recall_score(y_test_int, y_test_pred_rf, average='weighted') * 100
f1_train_rf = f1_score(y_train_int, y_train_pred_rf, average='weighted') * 100
f1_test_rf = f1_score(y_test_int, y_test_pred_rf, average='weighted') * 100
# Update the DataFrame for precision, recall, and F1 score
df_metrics_clf.loc[('Random Forest', 'train', 'precision'), 'value'] = precision_train_rf
df_metrics_clf.loc[('Random Forest', 'test', 'precision'), 'value'] = precision_test_rf
df_metrics_clf.loc[('Random Forest', 'train', 'recall'), 'value'] = recall_train_rf
df_metrics_clf.loc[('Random Forest', 'test', 'recall'), 'value'] = recall_test_rf
df_metrics_clf.loc[('Random Forest', 'train', 'f1'), 'value'] = f1_train_rf
df_metrics_clf.loc[('Random Forest', 'test', 'f1'), 'value'] = f1_test_rf
# Rounding and converting to percentages
df_metrics_clf['value'] = df_metrics_clf['value'].apply(lambda v: round(v, ndigits=2))
df_metrics_clf
| value | |||
|---|---|---|---|
| model | dataset | metric | |
| Logistic Regression | train | accuracy % | 26.94 |
| precision | 75.71 | ||
| recall | 26.94 | ||
| f1 | 36.95 | ||
| test | accuracy % | 27.32 | |
| precision | 74.70 | ||
| recall | 27.32 | ||
| f1 | 37.86 | ||
| KNN | train | accuracy % | 80.98 |
| precision | 77.67 | ||
| recall | 80.98 | ||
| f1 | 77.38 | ||
| test | accuracy % | 74.52 | |
| precision | 66.02 | ||
| recall | 74.52 | ||
| f1 | 69.30 | ||
| Random Forest | train | accuracy % | 100.00 |
| precision | 100.00 | ||
| recall | 100.00 | ||
| f1 | 100.00 | ||
| test | accuracy % | 78.09 | |
| precision | 72.91 | ||
| recall | 78.09 | ||
| f1 | 71.74 |
# Visualize classification metrics
data_clf = df_metrics_clf.reset_index()
g_clf = sns.catplot(col='dataset', data=data_clf, kind='bar', x='model', y='value', hue='metric', legend_out=False)
for ax in g_clf.axes.ravel():
for c in ax.containers:
ax.bar_label(c, label_type='edge')
ax.margins(y=0.2)
plt.show()
After comparing with Regression models, its clear that we would get better results from Classification!
In conclusion, the dataset from Google Play Store apps has been explored and analyzed using various data visualization techniques with the help of Matplotlib, Seaborn and Plotly libraries.
The preliminary analysis, visualization methods and EDA provided insights into the data and helped in understanding the underlying patterns and relationships among the variables.
The analysis of the Google Play Store dataset has shown that there is a weak correlation between the rating and other app attributes such as size, installs, reviews, and price. We found that there was a moderate positive correlation between the number of installs and the rating, suggesting that higher-rated apps tend to have more installs.
We also observed that free apps have higher ratings than paid apps, and that app size does not seem to have a significant impact on rating.